Dataset in use: TMDb movie data
I will try to answer the following questions
- Which genres are most popular from year to year?
- What kinds of properties are associated with movies that have high revenues?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
df = pd.read_csv('tmdb-movies.csv')
df.head(2)
Observation: We need to seperate genres data to make sure we can provide accurate statistics
df.describe()
df.dtypes
df.info();
sum(df.duplicated())
df.nunique()
df['release_year'].unique()
Check for popularity outlier since we will use it as our dependent variable during the analysis. It looks like we have three points on the right that need to be removed.
sns.boxplot(x=df['popularity']);
df.drop(['homepage', 'tagline', 'keywords', 'production_companies'], axis=1, inplace=True)
df.drop_duplicates(inplace=True)
sum(df.duplicated())
df.dropna(inplace=True)
df.shape
Removing popularity outlier
df = df[df.popularity < 20]
df.shape
df_genres = df.copy()
df_genres[~df_genres['genres'].str.contains('|')]
hb_genres_series = df_genres['genres'].str.split('|').apply(pd.Series).stack().reset_index(drop=True).unique()
hb_genres_series.size
Count the number of movies per genres
genre_year = pd.DataFrame(data=None,index=hb_genres_series,columns=df_genres['release_year'].sort_values(ascending=False).unique())
genre_year.fillna(value=0, inplace=True)
genre_year_popularity = genre_year.copy()
def count_genres(row):
for genre in row['genres'].split('|'):
genre_year.loc[genre, row['release_year']] += 1
df_genres.apply(count_genres, axis = 1)
genre_year.head()
genre_year['Max'] = genre_year.idxmax(axis=1)
genre_year['Max']
Sum up the popularity for genre
def genres_popularity(row):
for genre in row['genres'].split('|'):
genre_year_popularity.loc[genre, row['release_year']] += row['popularity']
df_genres.apply(genres_popularity, axis = 1)
genre_year_popularity.head()
genre_year_popularity['Max'] = genre_year_popularity.idxmax(axis=1)
genre_year_popularity['Max']
df.corr()
Revenue has a positively correlation with vote_count, budget, and popularity
df.plot(x='revenue', y='vote_count', kind='scatter');
df.plot(x='revenue', y='budget', kind='scatter');
The popularity of genre listed by year
fig, ax = plt.subplots(10, 6)
fig.set_figheight(35)
fig.set_figwidth(30)
for col in range(0, len(genre_year_popularity.columns) - 1):
genre_year_popularity.plot(y=genre_year_popularity.columns[col], kind="bar", ax=ax[int(col / 6), col % 6])
df.groupby(by='release_year').sum()['popularity'].plot();
The above plot show that the popularity is increasing from year to year. Thus, we can't use it for comparing popularity between years. A better way for comparing genre's popularity is to use count of each genre.
fig, ax = plt.subplots(10, 6)
fig.set_figheight(35)
fig.set_figwidth(30)
for col in range(0, len(genre_year.columns) - 1):
genre_year.plot(y=genre_year.columns[col], kind="bar", ax=ax[int(col / 6), col % 6])
For the last 12 years the most frequent genre is Drama followed by Comedy
plt.bar(genre_year.index.tolist(), genre_year_popularity[2015])
plt.bar(genre_year.index.tolist(), genre_year[2015], alpha=0.5)
plt.title("Popularity vs Count for genre iin 2015")
plt.xlabel("Genre")
plt.ylabel("Popularity/Count");
We can see that in 2015 the Drama is the most popular genre and also has the most number of movies produced. On the other side, we can see that action movies is the second popular genre but it is positioned the forth with respect to production count. So that we can see that the genre popularity is not always the motivation for producers.
plt.scatter(df[df.release_year==2015]['revenue'], df[df.release_year==2015]['budget'])
plt.title("Budget relation with revenue")
plt.xlabel("Revenue")
plt.ylabel("Budget");
Popularity is weakly corelated to revenue as it depends more on votes. This could show that popularity and votes are not the only source of revenu.
plt.scatter(df[df.release_year==2015]['revenue'], df[df.release_year==2015]['vote_count'])
plt.scatter(df[df.release_year==2015]['revenue'], df[df.release_year==2015]['popularity'])
plt.xlabel("Revenue");
It is obvious that the budget, vote_count, and popularity directly propotional to revenue in most cases. Let's take more insights by looking at the maximus revenue movie
df[df.revenue==df.revenue.max()]
| Value | Max value | |
|---|---|---|
| budget | 237000000 | 4.250000e+08 |
| vote_count | 8458 | 9767 |
| popularity | 9.432768 | 32.985763 |
The table above compare the values for maximum revenue movie with the absolutely maximum values of the three dependent variables we are analysing. The result confirm the correlation coefficient by example as vote_count is the strongest correlated to revenue and the nearest value to maximum.